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ABSTRACT 

Motivation: The Expectation-Maximization (EM) algorithm has been 
successfully applied to the problem of transcription factor binding site 
(TFBS) motif discovery and underlies the most widely used motif dis- 
covery algorithms. In the wider field of probabilistic modelling, the 
stochastic EM (sEM) algorithm has been used to overcome some of 
the limitations of the EM algorithm; however, the application of sEM to 
motif discovery has not been fully explored. 

Results: We present MITSU (Motif discovery by ITerative Sampling 
and Updating), a novel algorithm for motif discovery, which combines 
sEM with an improved approximation to the likelihood function, which 
is unconstrained with regard to the distribution of motif occurrences 
within the input dataset. The algorithm is evaluated quantitatively on 
realistic synthetic data and several collections of characterized pro- 
karyotic TFBS motifs and shown to outperform EM and an alternative 
sEM-based algorithm, particularly in terms of site-level positive pre- 
dictive value. 

Availability and implementation: Java executable available for 
download at http://www.sourceforge.net/p/mitsu-motif/, supported 
on Linux/OS X. 

Contact: a.m.kilpatrick@sms.ed.ac.uk 
1 INTRODUCTION 

Transcription factor binding site (TFBS) motifs are short DNA 
sequence patterns that have important roles in genetic transcrip- 
tional regulation. These patterns are of considerable interest to 
biologists, as they are central to understanding the mechanisms 
of gene expression. The discovery and further analysis of TFBS 
motifs remains an important and challenging problem in bio- 
informatics [examples from the recent ENCODE project include 
Spivakov et al. (2012), Whitfield et al. (2012) and Yip et al. 
(2012),]; as a result, there is continued interest in developing al- 
gorithms for unsupervised discovery of TFBS motifs (Bailey 
et al, 2010). 

The majority of TFBS discovery algorithms are probabilistic 
algorithms, which search the input data (usually a collection of 
promoter regions of coregulated genes) for sequences that are 
statistically over-represented. Deterministic algorithms make up 
a large proportion of commonly used algorithms for motif dis- 
covery. The deterministic Expectation-Maximization (EM) algo- 
rithm is one of the earliest probabilistic motif discovery 
algorithms (Lawrence and Reilly, 1990) and is the basis for a 
number of others, including the benchmark motif discovery al- 
gorithm MEME (Bailey and Elkan, 1994). However, the EM 
algorithm has several well-known limitations. For example, the 
EM algorithm is highly sensitive to its starting parameters. 
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Owing to this sensitivity and the use of a local search strategy, 
the EM algorithm cannot be guaranteed to converge to the 
global maximum of the likelihood function, instead converging 
to an insignificant local maximum or saddle point of the likeli- 
hood function. In general, the steps of the EM algorithm can 
become either analytically or computationally intractable in 
many practical situations. 

The stochastic EM (sEM) algorithm is motivated by the limi- 
tations of the deterministic EM algorithm, particularly the issues 
of intractability. Celeux et al. (1995) note that the sEM algorithm 
is generally more successful than the EM algorithm owing to 
stochastic perturbations, which allow the sEM algorithm to 
escape stable fixed points of the EM algorithm such as insignifi- 
cant local maxima of the likehhood function. In addition to this, 
retaining the underlying EM dynamics means that the sEM al- 
gorithm generally converges in a relatively small number of iter- 
ations in comparison with full stochastic methods. 

Stochastic variants of the EM algorithm have been applied to 
motif discovery previously; for example, the SEAM (Bi, 2007) and 
MCEMDA (Bi, 2009) algorithms. However, the power of sEM in 
a motif discovery context has not been fully explored. Most not- 
ably, these algorithms are limited to the 'one occurrence per se- 
quence' (OOPS) model, which places a constraint on the 
distribution of motif occurrences within the input dataset. 
Further, algorithms based on stochastic variants of EM have so 
far not implemented features commonly found in other motif dis- 
covery algorithms, including the ability to automatically deter- 
mine the most likely motif width from a range of plausible values. 

In this article, we present MITSU (Motif discovery by 
ITerative Sampling and Updating), a novel algorithm for 
TFBS motif discovery that combines a stochastic version of 
the EM algorithm with a derived dataset, which leads to an im- 
proved approximation of the likelihood function. Significantly, 
this likelihood function is unconstrained with regard to the 
number of motif occurrences in each input sequence. The algo- 
rithm also incorporates MCOIN, an information-based heuristic 
to automatically determine the most likely motif width 
(Kilpatrick et al, 2013). MITSU is evaluated quantitatively on 
realistic synthetic data and several collections of previously char- 
acterized prokaryotic TFBS sequences and shown to outperform 
an EM-based algorithm and the SEAM algorithm, most notably 
in terms of site-level positive predictive value. The results of add- 
itional tests demonstrate that MITSU has significant advantages 
over current sEM-based approaches for motif discovery. 

2 APPROACH 

This article implements an approach based on sEM for the 
purpose of TFBS motif discovery. Given a joint distribution 
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p(X, Z\0) over observed variables X and latent variables Z, 
governed by parameters 0, the deterministic EM algorithm 
(Dempster et al., 1977) maximizes the likelihood function 
p(X\0) with respect to 0. This likelihood function is intractable 
directly, so two steps are iteratively applied until some conver- 
gence criteria are reached to maximize the likelihood function. 
An initial estimate of the parameters is made, then the E-step 
calculates the expected value of the log likelihood function, with 
respect to the distribution of Z conditional on X under the 
current estimate of the parameters 0^^^: 

Q(0,0^^^) = E^^^,,^[\np(X,Zm (1) 

In the context of motif discovery, this can be viewed as calculat- 
ing the probability for each width-w subsequence in the dataset 
that it is an occurrence of the motif, or equivalently estimating 
the position of occurrences of the motif within the input dataset. 
The M-step then evaluates a new estimate of the parameters by 
maximizing the expected value of the log likelihood function: 

5)(^+i)= argmaxQ(^,^^^)) (2) 

e 

In the context of motif discovery, this can be viewed as reesti- 
mating the model parameters given the current estimates for the 
motif position within the input dataset. 

Stochastic variations of the EM algorithm first use Monte 
Carlo methods to draw a set of samples {z^^\ . . . , z^'^^} from 
the current approximation to the conditional predictive distribu- 
tion p(Z\X, ^^^^), before replacing the integral in the E-step of the 
EM algorithm Equation (1) with a finite sum over the drawn 
samples. The modified E-step is thus 

i M 

Q,^i(e, (f'^) « - ^ \np{X, Z*'"*!^) (3) 

m = 1 

The M-Step then requires maximizing the Q function as before. 
This particular variation on the EM algorithm is known as the 
Monte Carlo EM (MCEM) algorithm (Wei and Tanner, 1990). 

Stochastic EM (Celeux et al, 1995) can be viewed as a special 
case of MCEM, where only one sample is drawn at each iter- 
ation. In this case, the latent variables Z characterize which one 
of the mixture components is responsible for each point in the 
dataset, effectively making a 'hard' assignment of data points to 
mixture components, rather than the probabilistic weightings 
used by the EM algorithm. In the context of motif discovery, 
this would assign each data point to either the motif model or the 
background model. Formally, the sampling step (S-step, analo- 
gous to the E-step in EM) of the sEM algorithm replaces the 
computation of the Q function in the E-step by the simpler com- 
putation of />(Z|X, 0^^^) and simulation of a 'pseudosample' z^^l 
The update step (U-step, analogous to the M-step in EM) up- 
dates the model parameters 0^^^ on the basis of the 'pseudo-com- 
plete sample' {X, z^^^}, in the same way as normal. 

As noted above, one of the reasons for stochastic variations of 
EM being generally more successful than EM is that they have 
the ability to avoid insignificant local maxima of the likelihood 
function. This is achieved by choosing whether to accept or reject 
the new set of proposed model parameters in the U-step of the 
algorithm. Through this accept/reject mechanism, there is a non- 
zero probability of accepting new model parameters with a lower 



likelihood than the current parameters at each iteration of the 
algorithm (Celeux et al, 1995). In contrast, deterministic EM is 
guaranteed not to decrease the likelihood and so may become 
trapped in local maxima or saddle points of the likelihood 
function. 

One significant limitation of the SEAM algorithm is that only 
the OOPS model is implemented. Bi (2007) suggests that the 
OOPS model may be extended to the two-component mixture 
(TCM) model (which is unconstrained with regard to the distri- 
bution of motif occurrences) by first discovering a motif using 
the OOPS model, then scanning the input sequences to discover 
further occurrences. However, this strategy may not be statistic- 
ally robust. In this article, we take an approach that extends the 
OOPS model naturally to the 'zero or one occurrences per 
sequence' (ZOOPS) model, based on the original model defin- 
itions. We then continue this extension to a model that allows an 
arbitrary number of motif occurrences in each input sequence, 
using a previously described cutting heuristic. 

3 METHODS 

3.1 A sEM density for the OOPS model 

The idea underlying existing algorithms for motif discovery, which im- 
plement stochastic variants of EM (Bi, 2007, 2009), is to replace the 
computation and maximization of Q{6, 6^^^) by the much simpler compu- 
tation of p{Zij = 1 \Xi, 0^^^), drawing a number of samples Z^^^ (S-step), 
followed by an update to 6 based on the pseudo-complete samples (X,Z^^'*) 
(U-step). A suitable density to represent an input sequence is required. 
We begin by confirming that the density used by Bi (2007) to represent an 
input sequence using the OOPS model is consistent with the OOPS model 
derived by Bailey and Elkan (1994). 

We generalize the expression introduced by Bailey and Elkan (1994) to 
define the expectation of the missing data for position j in sequence / 
using the OOPS model as follows: 

Z« = 1 IX,, ^(0) . ^^_X^|Z.; = 1.^^^^) 

'Y, K^/|Z,/=1,^«) 

where is defined as the length of input sequence /, and w is defined as 
the motif width. Although Bi (2007) uses slightly different notation, we 
confirm that the definition used is equivalent to that of Bailey and Elkan 
(1994). Defining k as the set of nucleotides, that is, k e {A, C, G, 7], the 
conditional probability of sequence / given the hidden variables is defined 
in both methods as follows: 

p(Xi\Zij = i,e)^ 

leAijk = A w=\k = A 

This may be viewed as the product of two terms: the first calculating the 
probability of the background positions and the second calculating the 
probability of the motif positions. 

Here, we generalize the expressions used by Bailey and Elkan (1994) to 
define the joint (log) likelihood function for the OOPS model as follows: 

\np{X,Z\e)^ 

J2 E Zij\np(Xi\Z,j=l,e)^mn- — 

^ ^ Lj — w+l 

z = 1 7 = 1 

Again, despite notational differences, this can be shown to be equivalent 
to the expression as defined by Bi (2007). 
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To define a suitable density to represent an input sequence, Bi (2007) 
substitutes Equation (5) into Equation (4); cancelling the 'background' 
terms and taking logs for efficiency results in the expression, 



4';=Kz,v=iix,-,e<")= 

1 r T 

l=\k = A 



*(/) 



/(Xy+,_i=fc)ln 



(7) 



where is a normalizing factor such that 

Li-w+\ 
7=1 

Discussion of the sEM S- and U-steps is deferred to the following section, 
where they are presented in the context of the ZOOPS model. 



3.2 Extending sEM to the ZOOPS model 

Here, we follow a similar method to derive an expression representing a 
sequence in the ZOOPS model. The ZOOPS model, introduced by Bailey 
and Elkan (1994), assumes that the input sequences contain 'zero or one 
occurrences per sequence'. The ZOOPS model requires an indicator vari- 
able that denotes whether a particular input sequence contains a motif oc- 
currence. Here, the indicator variable Qi is defined as Qi = ^ ^ Z, y. 
That is, = 1 if sequence / contains a motif occurrence and d otherwise. 
The conditional Hkelihood for a sequence containing a motif occurrence 
remains the same [Equation (5)]. The conditional likelihood for a sequence 
that does not contain a motif occurrence is now defined as follows: 



(8) 



Defining an additional variable y as the prior probability of a motif occur- 
ring in a sequence and assuming a uniform prior distribution for motif 
occurrences within a sequence, it follows that the prior probability of a 
position in sequence / being a motif start site is 



p(Zij=m- 



Li - w+ 1 



(9) 



For simplicity, the model parameters are now collected and denoted as 
<t> = {0, y). It is noted that the model parameters now include the prior prob- 
ability of a sequence containing a motif occurrence, in addition to the motif 
and background models from the OOPS model. It can be shown that the log 
likelihood function for the complete data in the ZOOPS model can be 
generalized as follows: 

N /Li-w+\ 



Y,{\-Qr)\np{Xi\Qi = 0, 6) 

i=\ 

N 



(10) 



Li - w+ 1 

+E(i-e,)in(i-y) 

1 = 1 

The expectation of the missing data for the ZOOPS model is therefore 

KA'/|Z,,,= l,0<'>)j;^ 



L,— H'+ 1 

K^fiz,/=i,^«) 

\ 1=1 



(11) 



Li - w+ 1 



It can be shown that substituting Equations (5) and (8) into Equation (11) 
as required, then cancelling terms yields 



zi1 = 



I 

L/-H'+l 



(L,-- w+l)(l -yW) + 



(12) 



nn(|f) 



our expression representing a sequence in the ZOOPS model. 

The S-step of the sEM algorithm is implemented as described previ- 
ously (Bi, 2007), drawing a sample jl from Equation (12) for each input 
sequence / g {1, A^. The U-step of the sEM algorithm requires the 
construction of a proposal model & based on the samples from the S- 
step. The parameter updates provided by Bi (2007) are altered here to 
account for the fact that not every sequence may contain a motif occur- 
rence. The expected values of the variables are used to weight the 
samples from each sequence i. Here we define the parameters of our 
proposal model as 



(13) 



for w G {1, . . . , PF} and k ^{A,C,G,T\. The parameters of the back- 
ground model are not updated, but could be reestimated if required. ^ = 
^^2k-A ^ vector of pseudocounts, equivalent to a Dirichlet prior 
distribution. We also require an update for the other parameter y. It 
can be shown that the proposal value for the fraction of sequences con- 
taining a motif occurrence is just that, based on the values of q\^^ calcu- 
lated in the S-step: 



(14) 



As in SEAM (Bi, 2007), the Metropolis algorithm is used to decide 
whether to keep our updated parameters. The energies of the current and 
proposal models, G{(j)^^^) and G(0'), respectively, are calculated (how this 
is done is described in Section 3.4) and the change in energy calculated: 



AG = G((/)('^)-G((/)0. 

The Metropolis ratio is defined as 

aM(0', 0^'^) = min {1, exp(-AG)} 

A random number u ~ Unif[0, 1] is drawn and the parameters updated to 
the proposal parameters only if u is less than or equal to the Metropolis 
ratio, that is. 



(15) 



(16) 



/)(?+!)- 



ifw<aM(0^0^'^), 

otherwise, 



for w G {1, . . . , PF} and ^ G M, C, G, T\ and 

/, ifw<aM(0',0^'^), 



Y 



otherwise. 



(17) 



(18) 



3.3 Removing the ZOOPS constraint 

The ZOOPS model still enforces constraints on the distribution of motif 
occurrences; it is assumed that each input sequence contains at most one 
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occurrence of a motif. However, there are many biological examples of 
promoter sequences that contain multiple copies of the same TFBS 
(Bembom et al., 2007). This is the primary motivation for the TCM 
model introduced by Bailey and Elkan (1994), which allows an arbitrary 
number of non-overlapping motif occurrences in each input sequence. 

The likelihood function for the TCM model is more computationally 
complex than those for the OOPS and ZOOPS models. As a result, exact 
methods based on the TCM model have been avoided in favour of more 
tractable approximations (Bembom et al., 2007). The TCM model pro- 
posed by Bailey and Elkan (1994) uses a derived dataset consisting of all 
overlapping subsequences of width w from the original dataset. Some 
proportion of these subsequences are motif occurrences; the remainder 
are background. While the subsequences in this derived dataset are ne- 
cessarily overlapping, the likelihood function is based on a sample of 
independent sequences (Bembom et al., 2007). An additional smoothing 
step is required to reduce the degree to which two overlapping subse- 
quences can both be assigned to the motif component of the model. 

Keles et al. (2003) propose an alternative cutting heuristic, which in- 
volves deriving a different dataset from the original, then applying the 
ZOOPS model to each of the derived sequences. The main advantages of 
this method are that no additional steps are required to deal with the 
assumption of independence, and the approximation to the likelihood 
function is improved. This method is improved by Bembom et al. 
(2007) and we implement a similar method here. Briefly, the original 
dataset is cut into subsequences of a given length C/, such that each sub- 
sequence contains the first (w 1) positions of the next subsequence. The 
ZOOPS model is then applied to this derived dataset. The previous stu- 
dies implementing this heuristic have shown that the method is fairly 
robust with respect to the choice of cut length U but have suggested 
that this parameter may be optimized using cross-validation (Bembom 
et al, 2007; Keles et al., 2003). Here, the cut heuristic is implemented as 
an inner loop within the motif discovery algorithm (Section 3.5). The 
ZOOPS model is applied to derived datasets with varying values of C/, 
and the parameter settings that yield the highest energy value are returned 
as the best motif model. We show in Section 4.3 that the cut heuristic in 
combination with the ZOOPS model successfully allows discovery of 
multiple copies of the same motif within a single input sequence, in the 
context of motif discovery using sEM. 



3.4 Defining an energy function 

The original energy function used in the SEAM algorithm (Bi, 2007) 
becomes problematic when used with the cut heuristic used to implement 
discovery of multiple motifs within a single input sequence. The main 
problem stems from the fact that the energy function 

(T w T \ 

k=A j=\k=A / 

is scaled by the number of input sequences A^; this is assumed to be 
constant in the SEAM algorithm and means that energies cannot be 
compared between datasets with differing values of A^. Using the cutting 
heuristic means that the value of may double, or triple, depending on 
the cut length (C/). A way of fairly comparing motif energies is required. 
We are further interested in the properties of the energy function, par- 
ticularly how it varies with changing motif conservation and varying 
values of y. Here, we propose a modification to the original energy func- 
tion such that 

I I T w T \ 

y ^ \k = A j= \ k = A J 

This modified energy function is maximized with a perfectly conserved 
motif occurring in each input sequence, and the yN factor cancels in the 



case of datasets derived by the cut heuristic. It can be shown that the 
following useful properties hold: 

(1) If two motifs are perfectly conserved, the motif with the higher 
number of occurrences will have a higher energy. 

(2) Given two motifs of equal prevalence and unequal motif conser- 
vation, the motif discovery algorithm will tend to discover the 
motif with the higher energy (equivalently, the higher motif 
conservation). 

(3) All else being equal, a higher proportion of sequences containing a 
motif occurrence will yield a higher energy. 

We adopt this modified energy function in MITSU but note that other 
alternative energy functions may be possible; because the sEM accept/ 
reject mechanism is based on a difference of energies, substituting other 
energy functions based on the model entropy should have little effect on 
this mechanism. 

3.5 MITSU algorithm 

The pseudocode of MITSU is given as follows: 



procedure MITSU algorithm 

create Markov background model 

for W = Wnnn to "^max dO 

for cut length in {set of cut lengths} do 
for n random seeds do 

for y = l/VNto 1 by x 2 do 

run sEM on cut dataset using ZOOPS model at width w: 
until convergence do 
S-step (Equation 12) 
U-step (Equations 13-18) 
end 
end 
end 

return the best motif model over n random seeds & varying y 
end 

return the best motif model over all cut lengths 
end 

estimate most likely width w using MCOIN 
return motif model and list of predicted sites for w 
end MITSU algorithm 



Although satisfactory convergence results for sEM and related algo- 
rithms have been obtained (Diebolt and Robert, 1990, 1994), designing a 
stopping rule for sEM is challenging; Jank (2005) notes that a simple 
deterministic stopping rule may be triggered by what is a chance fluctu- 
ation stemming from the S-step of the algorithm. Following the recom- 
mendations of Booth and Hobert (1999), we implement a deterministic 
stopping rule for several iterations to reduce the chance of a premature 
stop. After each iteration, the Euclidean distance between the previous 
and current motif models is calculated. If this distance is below a given 
threshold for three successive iterations, the algorithm is deemed to have 
converged; we choose the threshold here as 10^. Stochastic EM generally 
takes longer to converge than deterministic EM (on tests with the CRP 
dataset used in Section 4.3, deterministic EM was approximately five 
times faster than MITSU, based on testing 1000 random seeds). 
However, as noted above, sEM usually converges faster than full stochas- 
tic methods. We accept this longer running time in exchange for increased 
accuracy in terms of predicted motif occurrences. We compare the con- 
vergence of MITSU with that of deterministic EM in Section 4.2. 

Motif occurrences are predicted using a Bayes-optimal classifier that 
has been described previously by Bailey and Elkan (1994). Following the 
ZOOPS model, we predict at most one motif occurrence per sequence in 
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Table 1. Realistic synthetic data: classification results 



Conservation (mean bits/col) 


Deterministic EM 




SEAM 






MITSU 






sSn 


sPPV 


AUC 


sSn 


sPPV 


AUC 


sSn 


sPPV 


AUC 


2.00 


0.84 


0.25 


0.99 


1.00 


1.00 




0.70 


0.74 


0.97 


1.49 


0.26 


0.07 


0.98 


0.93 


0.93 




0.90 


0.97 


1.00 


1.08 


0.02 


0.01 


0.96 


0.49 


0.49 




0.68 


0.77 


0.99 


0.76 


0.00 


0.00 


0.94 


0.09 


0.09 




0.17 


0.19 


0.94 


0.51 


0.00 


0.00 


0.93 


0.06 


0.06 




0.07 


0.08 


0.93 



Note: sSn, j-PPFand AUC for five collections of realistic synthetic data with varying levels of motif conservation. Best results are printed in bold. In these tests, motif discovery 
was carried out only at the known motif width. 



Table 2. Escherichia coli data: classification results 



Conservation(mean bits/col) Deterministic EM SEAM MITSU 



sSn sPPV AUC sSn sPPV AUC sSn sPPV AUC 



'High' (1.36) 0.81 0.22 0.96 0.67 0.67 — 0.54 0.75 0.98 

'Low' (0.78) 0.63 0.41 0.96 0.65 0.65 — 0.57 0.71 0.97 

Overall (1.13) 0.74 0.30 0.96 0.66 0.66 — 0.55 0.73 0.98 



Note: sSn, sPPV a.nd AUC for 20 datasets created using previously characterized E.coli TFBS sequences. Best results are printed in bold. In these tests, motif discovery was 
carried out only at the experimentally determined motif width. 



the cut dataset; the cut heuristic means that more than one occurrence per 
sequence may be predicted when these predictions are mapped back to 
the original dataset. 

4 RESULTS AND DISCUSSION 

Here, we summarize and discuss the results of a number of tests 
that illustrate the advantages of a sEM-based approach for motif 
discovery and the performance advantages of MITSU in particu- 
lar. Algorithmic performance is assessed through mean site-level 
sensitivity (sSn), mean site-level positive predictive value (sPPV) 
and the area under the receiver operating characteristic (ROC) 
curve (AUC). These measures are commonly used to assess the 
performance of motif discovery algorithms, for example, in the 
studies of Hu et al. (2005) and Tompa et al. (2005). Following 
these studies, a predicted motif site is defined as a true-positive 
result if it overlaps the true site by at least a quarter of the motif 
width. 

4.1 Stochastic EM outperforms deterministic EM 

MITSU was evaluated quantitatively using a mixture of realistic 
synthetic and previously characterized real data. Datasets were 
constructed as described previously (Kilpatrick et al., 2013). 
Briefly, five large data collections each consisting of 1000 datasets 
were constructed using synthetic motifs of varying conservation 
and realistic Escherichia coli background sequence extracted from 
the EcoGene database (Rudd, 2000). A sixth data collection con- 
sisting of 20 datasets was constructed using known E.coli TFBS 



sequences extracted from RegulonDB (Gama-Castro et al.,20\\). 
Finally, a data collection consisting of nine datasets was con- 
structed using known TFBS motif sequences from diverse pro- 
karyotic species. These motif sequences were discovered by ChIP 
methods. Background sequences for these datasets were con- 
structed using synthetic data, altering the probability of choosing 
each nucleotide to reflect the species GC-content as required. 
Tables 1-3 summarize the results of the tests on these data collec- 
tions. For comparison, we also include the results of a determin- 
istic EM-based motif discovery algorithm (Kilpatrick et al., 2013) 
and SEAM. AUC results are not available for SEAM, as con- 
structing a ROC curve requires ordering all subsequences accord- 
ing to their probability of being a motif occurrence. This is not 
possible in SEAM as a result of the method of prediction used. 

4. LI Realistic synthetic data Based on the results on realistic 
synthetic data shown in Table 1, we note that sSn and sPPV 
decrease with decreasing motif conservation for all three tested 
algorithms. We have noted this behaviour previously in deter- 
ministic EM (Kilpatrick et al., 2013) and attribute the decrease in 
sSn to fewer sites being predicted overall and the decrease in 
sPPV to the background sites better matching the motif sites 
as conservation decreases, leading to an increase in the number 
of false -positive results. 

We note that, in the majority of tests, the results of MITSU 
outperform those of both the deterministic EM algorithm and 
SEAM, particularly with regard to sSn and sPPV. The increased 
performance at lower levels of motif conservation is particularly 
notable. The success of MITSU is attributable to making fewer. 
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Table 3. Diverse prokaryotic data: classification results 



Conservation (mean bits/col) Deterministic EM SEAM MITSU 



SSn Sppv AUC sSn sPPV AUC sSn sPPV AUG 



0.99 0.75 0.67 0.99 0.86 0.86 — 0.88 0.92 1.00 



Note: sSn, sPPV and AUC for nine datasets created using real prokaryotic data determined through ChIP experiments. Best results are printed in bold. In these tests, motif 
discovery was carried out only at the experimentally determined motif width. 



but more accurate, predictions. The predictions made are gener- 
ally more cautious; previous false -positive predictions are now 
more likely to be classified as true-negative predictions. This sig- 
nificant reduction in the number of false-positive predictions ex- 
plains the large increase in the ^PPU values. 

We also note that the sSn and ^PPU results for the sEM-based 
algorithms are less biased. The results for the deterministic EM 
algorithm, particularly at high levels of motif conservation, are 
skewed towards increasing sSn; that is, fewer false-negative pre- 
dictions were made at the expense of having more false-positive 
predictions. SEAM and MITSU are more unbiased in this 
respect, producing fewer false predictions in general. 

4.1.2 Escherichia coli and prokaryotic ChIP data Tables 2 and 3 
present the results of tests on previously characterized E.coli 
TFBS sequences and TFBS sequences from diverse prokaryotes 
determined by ChIP experiments, respectively. The general trend 
remains the same: both sSn and sPPV decrease with decreasing 
motif conservation. We have reported previously that determin- 
istic EM-based motif discovery achieves better classification re- 
sults on previously characterized E.coli data than could be 
expected given realistic synthetic data of a similar conservation 
(Kilpatrick et al., 2013). Again, we attribute this improvement in 
performance to the differences in motif structure. Whereas the 
conservation of the synthetic motifs used here is independent of 
position, Eisen (2005) notes that real TFBS motifs with low 
mean conservation often have clusters of well-conserved pos- 
itions; we believe that differences in the distribution of high 
and low conservation across true motifs in comparison with syn- 
thetic motifs explains the improvement in performance on real 
data. We note a similar trend here with the results of SEAM and 
MITSU, particularly at lower levels of motif conservation. 

As with the reahstic synthetic data, MITSU is shown to in- 
crease sPPV by making fewer, more accurate, predictions 
(Table 2). We note that the sSn values are decreased to lower 
than the corresponding values from deterministic EM and (to a 
lesser extent) SEAM. This is a side effect of predicting fewer sites 
overall: 'borderline' predictions that may have been classified as 
true-positive results previously are now classed as false-negative 
results owing to the more cautious predictor. However, as with 
the reahstic synthetic data results, we note that the sSn and sPPV 
values for MITSU are now less biased. Although MITSU uses a 
Bayes-optimal classifier for site prediction, the results of the 
E.coli tests here suggest that a better balance between sSn and 
sPPV mdiy be achieved with a different predictor. However, we 
note that the complexity of the computational problem and the 




False positive rate False positive rate 

Fig. 1. ROC curves (plotted for 0 <sFPR< 0.5) for the E.coli TorR 
motif discovered by the deterministic EM algorithm (left) and MITSU 
(right). Curve colour illustrates the threshold of p(Zij=l\Xij,6), from 
highest (red) to lowest (blue) 



wide structural variety of TFBS motifs may mean that it is not 
possible to improve on all measures in all cases. 

MITSU is shown to be particularly effective in cases where 
the deterministic EM-based algorithm returned poor results. 
Figure 1 displays ROC curves for the E.coli TorR motif as dis- 
covered by both the deterministic EM and MITSU algorithms. 
This motif was poorly discovered by the deterministic EM algo- 
rithm (sSn = 0.10, sFFV = 0.03, AUC = 0.83); however, 
MITSU increases performance over all measures (sSn = 0.30, 
sFFV= 0.50, AUC = 0.98). As noted above, the significant im- 
provement in sFFVis attributable to predicting fewer sites over- 
all, reducing the number of false-positive results. In this case, the 
improvement in sSn is a result of an improved motif model, 
which better fits the known occurrences. Sequence logos repre- 
senting the motifs discovered by both algorithms are shown in 
Figure 2. Similar improvements in performance are also seen for 
the E.coli FruR and RscB motifs. 

Table 3 shows that for the diverse prokaryotic motifs, MITSU 
outperforms deterministic EM and SEAM in terms of all three 
performance measures. We note that the increase in sFFV is 
most significant. This result may be of particular interest to 
biologists, as it means that fewer false-positive results are pre- 
dicted: sites which are predicted now are therefore more likely to 
be true TFBS occurrences. As with the E.coli motifs above, we 
notice significant increases in performance for motifs that were 
relatively poorly discovered by deterministic EM, for example, 
the E.coli CRP and RutR motifs and the Bacillus subtilis SpoOA 
motif. 

Further tests were carried out in which the MCOIN heuristic 
was used to determine the most likely motif width from a range 
of plausible widths (±4 bp of the experimentally determined 
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Fig. 2. Sequence logos representing the E.coli TorR motif as discovered 
by the deterministic EM algorithm (top) and MITSU (bottom) 

motif width). When the true motif width is unknown, the per- 
formance of MITSU is decreased shghtly; the overall results on 
the E.coli dataset and the diverse prokaryotic dataset when using 
the MCOIN heuristic (sSn = 0.43, sPPV = 0.68, AUC = 0.97 
and sSn = 0.85, sPPV = 0.88, AUC = 1.00, respectively) show 
that MITSU continues to outperform both deterministic EM 
and SEAM in terms of sPPV and AUC, but decreases in sensi- 
tivity compared with the previous results (Tables 2 and 3). 
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4.2 Stochastic EM escapes local maxima 

One major motivation for the sEM algorithm is the fact that the 
deterministic EM algorithm cannot be guaranteed to converge to 
the global maximum of the likelihood function and may instead 
converge to a saddle point or local maximum of the likelihood 
function. While sEM also cannot be guaranteed to converge 
to the global maximum of the likelihood function, it can be 
demonstrated that the stochastic perturbations of sEM allow 
sEM-based algorithms to escape local maxima, which trap de- 
terministic EM-based algorithms, in a motif discovery context. 

We construct a dataset comprising 10 sequences of 200 nt in 
length, each sequence containing a single occurrence of a per- 
fectly conserved motif of width 8 bp. As before, E.coli inter genie 
sequences extracted from EcoGene were used as background 
positions. Despite the relative simplicity of the dataset, we 
expect that there will be a large number of local maxima in the 
likelihood function, corresponding to patterns that are better 
conserved than the background but less well conserved than 
the motif of interest. 

Energy traces for two runs of both the deterministic EM al- 
gorithm and MITSU are shown in Figure 3. Both algorithms are 
initialized with the same parameter values and allowed to run to 
convergence. Both traces illustrate one of the major differences 
between deterministic and sEM: while each iteration of determin- 
istic EM is guaranteed not to decrease the likehhood, sEM has a 
non-zero probability of accepting new model parameters that 
decrease the likelihood, to escape local maxima of the likelihood 
function. The top trace illustrates a case where deterministic EM 
converges to a local maximum at around G((p)= —0.52. In 
contrast, although sEM spends ~40 iterations around 
G((p)= —0.70, a small jump that decreases energy at iteration 
53 is followed by several iterations, which dramatically increase 
the energy. Using our stopping rule, sEM converges at 
G((p)= —0.14, the energy corresponding to perfect discovery 
of the known motif. The lower trace in Figure 3 shows a case 
where both algorithms converge to G((p)= —0.14. This trace 
illustrates that deterministic EM generally converges faster 



Energy trace 




0 10 20 30 40 

Iteration number 

Fig. 3. Energy traces for two runs of both the deterministic EM algo- 
rithm (blue) and MITSU (red) on a synthetic dataset containing a per- 
fectly conserved motif of width 8 bp. Algorithm convergence is marked 
with 'x' in both cases. We note that the sEM algorithm allows MITSU to 
escape local maxima of the likelihood function, which can trap determin- 
istic EM (top) 

than sEM, which can spend a relatively large number of iter- 
ations exploring lower energies before converging. However, 
we see this slower convergence as a small trade-off in exchange 
for more accurate motif models and binding site predictions, as 
shown in the top energy trace. 

4.3 MITSU successfully discovers multiple motifs in a 
single sequence 

As noted in Section 3.3, the cut heuristic in combination with the 
ZOOPS model allows discovery of multiple motif occurrences 
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Fig. 4. CRP motif sequence logos. From top: logo constructed from the 
24 binding sites contained in the CRP dataset; logo representing the motif 
discovered by MITSU; logo representing the motif discovered by MEME 
when the number of known sites was not provided 



within a single input sequence. We present a proof of principle 
using the well-known CRP dataset, used by Bi (2007), Lawrence 
et al. (1993) and Stormo and Hartzell (1989), among others. 
Briefly, CRP is a prokaryotic transcription factor that is import- 
ant in the regulation of genes involved in energy metabohsm. The 
CRP dataset consists of 18 sequences, each of which is 105 nt in 
length. The dataset contains 24 CRP binding sites determined by 
footprinting experiments or sequence similarity to confirmed 
binding sites; each sequence in the dataset contains one or two 
sites. Each binding site is 22 bp in length. Figure 4 (top) shows 
the CRP motif sequence logo constructed from the 24 binding 
sites in the dataset; we note that the low conservation and 
gapped nature of the CRP motif increases the challenge of com- 
putational discovery. 

We compare MITSU against MEME and assume that the true 
motif width is known; both algorithms are run at this width. 
MITSU was run with the cut length U equal to half the length 
of each input sequence. The results of this test show that MITSU 
predicted 28 binding sites {sSn = ^.l\, sPPV =0.61, 
AUC = 0.99) and successfully predicted both binding sites in 
the CEICG, ARA and LAC sequences. The middle logo in 
Figure 4 represents the motif discovered by MITSU. Based on 
this result, MITSU compares well with MEME, which predicted 
18 binding sites and failed to discover more than one site in a 
sequence using the TCM model when the total number of sites 
was not provided (sSn = 0.71, sPPV = 0.94). Fourteen of the 
sites predicted by MEME were also predicted by MITSU. The 
bottom logo in Figure 4 represents the motif discovered by 
MEME when the number of known sites was not provided. 
This motif is shifted by 3 bp compared with the motif constructed 
from the known binding sites. When the total number of sites 
was used as additional information, MEME predicted 24 binding 
sites and successfully predicted both binding sites in the CEICG, 
DEOP2 and MALK sequences (sSn = sPPV = 0.83). Sixteen of 
the sites predicted by MEME were also predicted by MITSU. 

Comparing the sequence logos representing the motifs dis- 
covered by MITSU and MEME shown in Figure 4, we note 
that the positions in the motif discovered by MITSU are gener- 
ally underweighted compared with the known motif and that the 



positions in the motif discovered by MEME are generally over- 
weighted. This difference in weighting is due to the number of 
sites predicted by each algorithm. Both algorithms return the 
same number of true-positive predictions; the number of false- 
negative predictions is also equal, leading to identical sSn results. 
MITSU predicts more false-positive sites than MEME, which 
leads to an underweighting of the positions in the model dis- 
covered by MITSU compared with that discovered by MEME. 
This also provides an explanation for the decreased ^PPK result 
(0.61 versus 0.94, respectively). While there is room for improve- 
ment, the cutting heuristic is shown to successfully reproduce the 
TCM model in principle without additional heuristic optimiza- 
tions to improve performance. 



5 CONCLUSION 

Computational discovery of TFBS motifs remains an important 
and challenging problem in bioinformatics. MITSU is a novel 
algorithm for motif discovery, based on sEM. MITSU has a 
clear advantage over deterministic algorithms in that it is less 
likely to converge to insignificant local maxima of the likelihood 
function owing to the sEM algorithm, improving results. We 
show that the sEM algorithm allows MITSU to escape these 
local maxima and converge to models with higher energies. 
MITSU also has advantages over existing sEM-based motif dis- 
covery algorithms as it is unconstrained with regard to the dis- 
tribution of motif occurrences within the input dataset and 
incorporates useful features commonly found in modern motif 
discovery algorithms, such as automatic determination of motif 
width. 

Results of tests on several collections of reahstic synthetic data 
and two collections of previously characterized prokaryotic data 
show that MITSU consistently outperforms deterministic EM 
and the SEAM algorithm for motif discovery in terms of site- 
level positive predictive value and generally performs at least as 
well in terms of overall correctness of results, based on ROC 
analysis. We note that the results returned by MITSU also 
often increase site-level sensitivity. Using the well-known CRP 
dataset, we demonstrate that MITSU combines a cut heuristic 
with the ZOOPS model to effectively reproduce a TCM model 
without the compromise of additional 'smoothing' steps. 

Future work will implement probabilistic (or 'soft') erasing to 
discover multiple different motifs within a single dataset and will 
investigate exploiting the Metropohs accept/reject mechanism to 
incorporate relevant model-level biological knowledge. Such 
heuristics will be important in further optimizing performance, 
as is the case for estabhshed motif discovery algorithms. 
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